Exploring the data

First of all, I want to Check the data dowloaded…

dim(df)
## [1] 1599   13

check for missing values…

anyNA(df)
## [1] FALSE

check for out of scale quality values…

any(df$quality > 10 || df$quality < 0)
## [1] FALSE

check for negative values…

any(df < 0)
## [1] FALSE

Once checked let’s start with the analysis:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There are 1599 different wines observations with 13 variables each one with the following distribution:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The general summary shows some interesting values to consider:

Given these extreme values, it will be interesting to see dispersion of the chemical properties:

##        fixed.acidity     volatile.acidity          citric.acid 
##          1.741096318          0.179059704          0.194801137 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          1.409928060          0.047065302         10.460156970 
## total.sulfur.dioxide              density                   pH 
##         32.895324478          0.001887334          0.154386465 
##            sulphates              alcohol              quality 
##          0.169506980          1.065667582          0.807569440

The total.sulfur.dioxide, free.sulfur.dioxide or residual.sugar dispesion is too high, as we suspected above.

Univariate Plots Section

Let’s see some data distribution for each variable

And the Normal Q-Q for each one

Looking at the plots we can observe several properties right skewed, and few properties seems normally distributed like pH, density and quality as well, but the density line in their histogram is not showing that.

Let’s take a look what if we change the scale to logaritmic to all the vars…

Interesting… we can see that the variables skewed like fixed.acidity, volatile.acidity, chlorides and sulphates turn to a ‘more’ normal distribution. Let’s add them to the dataframe:

#adding transformed vars to the dataframe with the suffix '.log'
df$fixed.acidity.log <- log10(df$fixed.acidity)
df$volatile.acidity.log <- log10(df$volatile.acidity)
df$chlorides.log <- log10(df$chlorides)
df$sulphates.log <- log10(df$sulphates)

Performing again a Normal Q-Q test with the new log variables:

It seems that the new variables perform better than the original ones..

Due to the importance of quality in this study, lets take a look at quality histogram deeper.

Some insights about quality:

Lets test the normality of quality

## 
##  Shapiro-Wilk normality test
## 
## data:  df$quality
## W = 0.85759, p-value < 2.2e-16

Well… looking at the resutls we can NOT ensure that quality is normally distributed, but we can assume it with a relative high level of confidence, considering the sample size and the normal curve. The Q-Q plot, support my decision, taking into account that quality is a discrete variable.

Univariate Analysis

What is the structure of your dataset?

See Exploring the Data

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality as discused here

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

At this point is difficult to answer this question, but intuitively I suggest alcohol, acidity related features, and sulfur Dioxide, but not sure which one of the two free.sulfur.dioxideor total.sulfur.dioxide. I need a correlation test to answer this question with more confidence, that I’ll performe in the next section.

Did you create any new variables from existing variables in the dataset?

Yes, I did. I eplain why in the next question. Additionally, I’m thinking to split quality into bad, medium and good wines

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As we can see in the Features Histograms & Normality,citric.acid is the feature with an unusual distribution.residual.sugarand chrolideshave a right-skewed distribution with a heavy long tail, what sugests the existence of outliers.

I’ve performed a logaritmic transformation to fixed.acidity, volatile.acidity, chlorides and sulphates due to their right-skewed distribution, that transforms the var into a more normalized distribution.

Bivariate Plots Section

Let’s start with the correlation matrix applied to the dataframe with the new variables:

##                      citric.acid residual.sugar free.sulfur.dioxide
## citric.acid           1.00000000     0.14357716       -0.0609781292
## residual.sugar        0.14357716     1.00000000        0.1870489951
## free.sulfur.dioxide  -0.06097813     0.18704900        1.0000000000
## total.sulfur.dioxide  0.03553302     0.20302788        0.6676664505
## density               0.36494718     0.35528337       -0.0219458312
## pH                   -0.54190414    -0.08565242        0.0703774985
## alcohol               0.10990325     0.04207544       -0.0694083536
## quality               0.22637251     0.01373164       -0.0506560572
## fixed.acidity.log     0.66716292     0.10927782       -0.1509186427
## volatile.acidity.log -0.56495716     0.01051618       -0.0001783224
## chlorides.log         0.18178017     0.10228456       -0.0021952453
## sulphates.log         0.33151619     0.01601568        0.0480522317
##                      total.sulfur.dioxide     density          pH
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
## fixed.acidity.log             -0.10460535  0.67477009 -0.70636020
## volatile.acidity.log           0.09281289  0.04542565  0.22311544
## chlorides.log                  0.05837562  0.35193852 -0.28362873
## sulphates.log                  0.01329585  0.16612354 -0.15411585
##                          alcohol     quality fixed.acidity.log
## citric.acid           0.10990325  0.22637251        0.66716292
## residual.sugar        0.04207544  0.01373164        0.10927782
## free.sulfur.dioxide  -0.06940835 -0.05065606       -0.15091864
## total.sulfur.dioxide -0.20565394 -0.18510029       -0.10460535
## density              -0.49617977 -0.17491923        0.67477009
## pH                    0.20563251 -0.05773139       -0.70636020
## alcohol               1.00000000  0.47616632       -0.09885158
## quality               0.47616632  1.00000000        0.11423756
## fixed.acidity.log    -0.09885158  0.11423756        1.00000000
## volatile.acidity.log -0.22862294 -0.39124918       -0.26393947
## chlorides.log        -0.30396099 -0.17613996        0.19892956
## sulphates.log         0.13515624  0.30864193        0.19790674
##                      volatile.acidity.log chlorides.log sulphates.log
## citric.acid                 -0.5649571634   0.181780174    0.33151619
## residual.sugar               0.0105161845   0.102284562    0.01601568
## free.sulfur.dioxide         -0.0001783224  -0.002195245    0.04805223
## total.sulfur.dioxide         0.0928128858   0.058375623    0.01329585
## density                      0.0454256502   0.351938519    0.16612354
## pH                           0.2231154407  -0.283628734   -0.15411585
## alcohol                     -0.2286229380  -0.303960993    0.13515624
## quality                     -0.3912491821  -0.176139965    0.30864193
## fixed.acidity.log           -0.2639394669   0.198929558    0.19790674
## volatile.acidity.log         1.0000000000   0.127885951   -0.29473814
## chlorides.log                0.1278859514   1.000000000    0.24307622
## sulphates.log               -0.2947381357   0.243076224    1.00000000

Analyzing the correlations, we can observe that the most correlated properties with quality are fixed.acidity.log, volatile.acidity.log, citric.acid, chlorides.log, total.sulfur.dioxide, density, sulphates.log and alcohol.

Let’s study the correlation with some boxplots:

Now I will create a new factorized variable called quality.bucket as I mentioned abobe, by cutting the original quality variable into bad, medium and good wines,. The cutting levels will be:

# Create a new Factorized variable cuting the quality original variable
dfCor$quality.bucket <- cut(dfCor$quality, breaks = c(0, 4, 6, 10),right = TRUE, labels = c("Bad", "Medium", "Good"))

Let’s see the boxplots now:

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In this part of the study, I selected few variables (fixed.acidity.log, volatile.acidity.log, citric.acid, chlorides.log, total.sulfur.dioxide, density, sulphates.log and alcohol), that seems to have more correlation with ths quality variable, thanks to the correlation matrix.

After taht, I decided to plot each of the selected variables using the boxplot, getting some interesting insights:

  • citric.acid, alcohol, fixed.acidity.log and sulphates.log have positive relation with the quality of wine.
  • density, volatile.acidity.log and chlorides.log have negative relation with the quality of wine
  • total.sulfur.dioxide has no aperent relation

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are several interesting realtions that need to be observed:

  • As I can expect, pH has strong correlation with the acidity rtelated properties like acid.citric (positive) or fixed.acidity (positive) .
  • density is correlated with acid.citric or residual.sugar, as well as alcohol or fixed.acidity. The realtion of the density with the residual.sugar (positive) or alcohol (negative) is expected, but not with the acid.citric (positive) or the fixed.acidity (positive).

What was the strongest relationship you found?

quality is strongly correlated with alcohol near followed by volatile.acidity.

Multivariate Plots Section

Let’s see how this main features are related between them, but now grouped by quality.bucket :

Let’s take a look more closely at how alcohol is realted with the other main variables grouped by quality.bucket and plot the density curves:

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Using Multivariate Analysis, let me observe more closely the relations between the properties, and some of the relations found at Bivariate Analysis has been confirmed:

  • volatile.acidity and density has strong negative correlation with quality and alcohol
  • alcohol has strong positive correlation with quality, citric.acid and sulphates

Were there any interesting or surprising interactions between features?

Yes, there were. First of all, that the amount of alcohol in a wine, seems to increase the quality, and second, that the good wines tend to have more sulphates, which change my idea that ‘the more preservatives a wine has worst is the quality’.


Final Plots and Summary

For this part of the Study, I will transform the log variables to the original ones, so the values shown in the plots will be the correct.

Plot One

Description One

This plot is interesting because concentrates the main features that, after my investigation, seems to influence more in the wine quality.

  • Three of them are positive correlated citric.acid, sulphates.log and alcohol.
  • The remaining one is volatile.acidity which is negative correlated with quality.

Plot Two

Description Two

This is one of the most important conclusions extracted from my investigation. The Quality of wine gets better with higher levels of alcohol and lower Volatile Acidity.

Plot Three

Description Three

The last plot shows the second important conclusion. The Quality of wine gets better with higher levels of sulphates. I supose that with sulphates, as a good preservative, the wine gets worst slowly than the more ‘natural’ ones. More Sulphates and more Alcohol, better Wine.


Reflection

The Exploratory Data Analysis done in this project and along all the course, has been very useful to understand not only the utilization of R, but for understanding meaning of some staistical tools in real situations. In this case, the quality of red wines in function of other chemical properties. Tools like scatterplots or density plots, helped me to draw the long list of values in a simple and meaningful way.

The findings were surprisingly positive, like the positive relation between alcohol and quality, or the negative one with amount of sulphates. Before this study, I was completely misunderstood, because I thought that a ‘bad’ wine had more % of Alcohol.

We need to keep in mind, that this is a small sample of red wines, and the quality variable is a subjective value of an expert, with all the implications that have… maybe a better solution is to have a median value of a group of experts.

An other consideration, is that it would have been interesting to have the Geolocation of each observation, so we could compare the wine quality with the designation of origin, and plot it into a map. The result could be interesting…

References